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High-level scripting languages are in many ways polar opposites to GPUs. 
GPUs are highly parallel, subject to hardware subtleties, and designed for 
maximum throughput, and they offer a tremendous advance in the perfor- 
mance achievable for a significant number of computational problems. On 
the other hand, scripting languages such as Python favor ease of use over 
computational speed and do not generally emphasize parallelism. PyCUDA 
is a package that attempts to join the two together. This chapter argues 
that in doing so, a programming environment is created that is greater than 
just the sum of its two parts. 

We would like to note that nearly all of this chapter applies in unmodified 
form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize 
the same concepts as PyCUDA for OpenCL. 
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1 Introduction, Problem Statement, and Context 

How are computational codes created? Their life cycle often begins with a 
quick proof of concept in a high-level language such as MATLAB or Python 
that aims to examine the suitability of the proposed method for the appli- 
cation problem at hand. Once this is established, the problem size is scaled 
up, and with it, the demands for execution speed grow. In principle, it is not 
desirable that the working proof-of-concept code would need to be changed 
merely to squeeze out better computational performance. Unfortunately, at 
present, very few workloads can be scaled in this fashion, so eventually the 
need arises to employ a language that produces efficient machine code. 

The desire to make this transition as seamless as possible quite naturally 
leads to hybrid code, in which only those pieces requiring improved per- 
formance are changed, and the remainder is kept as-is. These approaches 
are not new (cf. e.g. MATLAB's Mex or Python's F2Py). Unfortunately, 
they are not as wide-spread as they could be because they come with a 
significant complexity burden. GPU computing adds a new facet to this 
issue, as GPU programs are always hybrids, even though NVIDIA's CUDA 
Runtime programming interface goes to considerable lengths to paper over 
this fact. GPU programs are thus very naturally split between performance- 
hungry and performance-indifferent parts, usually along the same lines as 
the CPU-GPU boundary. This is fortunate, as one may easily substitute the 
performance-indifferent part of the hybrid, taking advantage of the already 
well-defined interface between the two. PyCUDA fits exactly into this niche 
and allows codes written in this high-level language to obtain GPU perfor- 
mance from within existing Python programs with a minimum of effort. 

But in addition to a practical route to high performance for existing codes, 
PyCUDA also has much to offer to the seasoned GPU programmer creating 
full-scale, production codes. M a,ny of these advanced benefits are described 



in detail in a recent manuscript iKlockner et al.l 2009l | , which focuses on the 



capability of generating GPU code at run time and represents a more aca- 
demic discussion of the software engineering aspects of GPU programming 
and what PyCUDA does to address them. This chapter on the other hand 
provides a hands-on perspective of the things one can do with PyCUDA, 
and how. 



2 Core Method 

A common lament in the field of scientific computing concerns the ever- 
widening gap between hypothetical machine capabilities and the effort an 
individual programmer is able to spend to move a computational application 
towards exploiting a machine to the full extent of its capability. GPUs 
obviously influence this balance, by increasing the maximum capability, by 
increasing the performance that is obtained with an "average" amount of 
effort, and by requiring a different set of skills to achieve genuinely high 
performance. 

More sophisticated tools, compilers and libraries are generally hoped to level 
this field, enabling users to achieve good results even with modest invest- 
ment. PyCUDA is a contribution to this discussion of tools for GPU com- 
puting. PyCUDA (and likewise its sister project PyOpenCL) has a two-fold 
aim. First, it aims to simplify the usage of existing basic concepts of CUDA 
C. Importantly, it does not attempt to change or reinvent the basic notions 
of GPU programming, but instead, as a foundation for further tools, just 
exposes them as-is through a carefully engineered interface. Key features of 
this interface include generous use of sensible defaults (all of which can be 
overridden), automatic error checking, and resource management. Second, 
and strictly on top of the first, basic layer, PyCUDA provides abstractions 
that support a numb er of very common usage patterns involving a GPU 



equivalent of NumPy [Oliphantl . |2006| | arrays. 



As a particular consequence of the design choice to leave as much of the un- 
derlying concepts in place, device kernels occur in PyCUDA programs and 
PyOpenCL as simple strings. This is practical because Python allows so- 
called triple-quoted strings which may extend across line boundaries, thus 
appearing as contiguous code blocks without intervening string terminators 
or other traces of the host-level language. A different approach is pursued by 
the Clyther projecl|j and similar efforts, which let the user write kernels in 
the host language and then translate them to the device language (C/C-l— 1-) 
in some manner. Each approach represents one end of a trade-off. It is 
obviously appealing to let the user program in one single language (the host 
language) rather than forcing her to master two separate ones. PyCUDA 
and PyOpenCL sacrifice this advantage in favor of passing on the concepts 
of the underlying programming system in as unmodified a manner as pos- 
sible in order to provide direct applicability of existing CUDA/OpenCL 



^http : //clyther . sourcef orge . net/ | 



documentation and to decrease the maintenance burden as the underlying 
system evolves. One additional factor entering the trade-off is the ease with 
which kernel code can be automatically generated (see Section IHTSj) . 
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Figure 1: Workflow of PyCUDA GPU program compilation. PyCUDA aims 
to maintain a scripting-like "edit-run-repeat" style of working for the user. 
The compilation and caching operations in the gray box are performed with- 
out user involvement. 

Programming in scripting languages (like Python) tends to be quite satis- 
fying to the programmer because of near-immediate response during devel- 
opment, the possibility of interactive exploration, and good error reporting. 
Although PyCUDA gives the user access to a compiled language (CUDA C), 
it attempts to avoid the development iteration penalty commonly associated 
with compiled languages, instead retaining the satisfaction and immediacy 
of scripting. First of all, it makes invocations to the CUDA C compiler fast 
and transparent. To this end, it employs a compiler caching mechanism 
that is pictured in Figure [TJ As it is likely that only one GPU kernel is 
being changed in each development iteration, the user only needs to wait for 
compilation of this one kernel. The loading of all other kernels in the code 
will be near-instantaneous thanks to PyCUDA's caching mechanism. 

We have thus completed an initial description of PyCUDA's goals and usage 
patterns. The next section will, by means of a number of concrete examples 
of increasing complexity, show how common tasks can be accomplished in 
PyCUDA. Section H] will then briefly reflect on what was shown, and we 
close in Section [6] with a few remarks and ideas for future work. 



3 Algorithms, Implementations, and Evaluations 

In this section, we will take the reader on a brief journey through differ- 
ent aspects of programming GPUs using PyCUDA, starting with a basic 
hello-world example in Section 13. 1^ through more advanced aspects in the 
following sections. 

3.1 The Basics of GPU programming with PyCUDA 

Perhaps the simplest useful program that can be written using PyCUDA is 
shown in Listing [H which we will discuss here step-by-step. 

PyCUDA's interface to the 'nuts and bolts' of the CUDA programming 
system can be found in pycuda. driver and is imported here under the 
alias cuda. The module is called driver because it exposes the so-called 
"driver-level" programming interface of CUDA, which is more flexible than 
the more commonly used CUDA C "runtime-level" programming interface, 
and it has a few features that are not present in the runtime. (If a program 
uses the triple-angle-bracket syntax for kernel invocation, it is using the 
runtime interface.) 

The next imported module, pycuda. autoinit, automatically picks a GPU 
to run on, based on availability and the number, if any, to which the 
CUDAJDEVICE environment variable is set. It also creates a GPU context for 
subsequent code to run in. Both the chosen device and the created context 
are available from pycuda. autoinit as importable symbols, if needed. The 
use of pycuda. autoinit is not compulsory-if needed, users can construct 
their own device-choice and context-creation methods using the facilities 
available in pycuda. driver. 

The last import in this simple example is numpy, the Python array pack- 
age. Since most GPU computations involve large arrays of data, PyCUDA 
integrates tightly with numpy. 

After creating a 4 x 4 array of random single-precision floating point num- 
bers on the CPU in the numpy array identified by the variable a, memory 
for a is allocated on the GPU as a_gpu. The object returned here is a 
DeviceAllocation object, whose lifetime is coupled to that of the GPU 
memory allocation, i.e. once the last reference to the DeviceAllocation 
disappears, the object becomes eligible for garbage collection, and once that 
happens, the GPU allocation also disappears. DeviceAllocation objects 




import pycuda. driver as cuda 
import pycuda. autoinit 
import numpy 

a = numpy. random. randn(4, 4) \ 
. astype(numpy.float32) 
a_gpu = cuda.mem_alloc(a.nbytes) 
cuda.memcpy_htod(a_gpu, a) 

mod = cuda.SourceModule(""" 

—global— void multiply2( float *a) 

{ 

int i = threadldx.x + threadldx.y*4; 

I a[i] *= 2; 
1 Compute Kernel 

) 

func = mod.get_function("multiply2") 
func(a_gpu, block=(4,4,l)) 

a_doubled — numpy. empty_like(a) 
cuda. memcpy_dtoh(a_dou bled, a_gpu) 
print a_doubled 
print a 



Listing 1: An example of the use of PyCUDA, showing the use of the 
SourceModule facihty. This simple program uploads a 4 x 4 array of single- 
precision floating point numbers, multiplies them by two on the GPU, and 
retrieves the result. 

can also be cast to integer for pointer arithmetic. Once memory is allocated, 
the contents of a are copied from the host to the device (hence "htod"). Ob- 
serve that no explicit error checking occurs at any stage of this program. If 
an error is encountered, a subclass of pycuda. driver .Error is raised as an 
exception. 

Now that the data is prepared, the most interesting part of the program 
begins, in which the source code for the GPU kernel is passed to the con- 
structor of SourceModule. At this point, the NVIDIA compiler is invoked, 
coupled with the caching mechanism as described in Section [21 at the end 
of which the user obtains a handle to a driver .SourceModule representing 



the binary code uploaded to the GPU. From this module handle, the user 
may then obtain a Kernel handle by means of the get_f unction method. 
Observe that C++ name manglingj would generally make the symbol names 
passed to get_f unction complicated and non-human-readable. For this rea- 
son, PyCUDA automatically wraps the code passed to SourceModule in an 
extern "C" declaration. If the use of C++ features is desired, the keyword 
argument no_extern_c may be passed to SourceModule to avoid this. In 
this case, any __global__ entry points should be declared extern "C" by 
hand. 

The Kernel handle, once it has been obtained, can then be called like any 
other function. Keyword arguments grid and block determine the size of 
the computational grid and the thread block size. DeviceAllocation in- 
stances may be passed directly to kernels, but other arguments incur the 
problem that PyCUDA knows nothing about their required type. There 
are two ways to address this: First, pass all such arguments as numpy sized 
scalars, such as numpy.f loat32(5.7) for single-precision floating point, or 
numpy. intp(p) for pointer-sized integers. Alternatively, one may use pre- 
pared kernel invocation, in which the user informs PyCUDA explicitly about 
the kernel's argument types. In this case, the invocation line would need to 
be changed as follows: 

func. prepare(" P" , block=(4,4,l)) 
func. prepared_call ((1, 1), a_gpu) 

Note that the prepare method only needs to be called once. Its first ar- 
gument represents the kernel's arguments as a type string as accepted by 
the struct module in Python standard library, e.g. "P" in the example 
specifies a single pointer argument. The prepared call styles also differs in 
that the thread block size is set at prepare time (but can be changed), and 
the grid dimension may be set with each call. Prepared kernel invocation is 
also slightly faster than the explicitly-sized invocation style. 

In this example, the function of the GPU code itself is trivial-for a 4 x 4 
block size and a single-element grid, each entry of the corresponding array 
on the GPU is multiplied by two. As the kernel is invoked, the kernel call 
enters the queue on the GPU, but, like in the CUDA "runtime" interface, 
the invocation returns immediately and does not wait for completion on the 



^"Name mangling" facilitates function overloading in C++ and represents the encoding 
of signature information in the symbol name used for a function. Mangling methods vary 
by compiler and operating system ABI. 



GPU. An argument stream may be passed in either calling style to specify 
a Stream in which execution is to take place. 

At the end of this simple example, memory is allocated on the CPU and 
results are transferred back, to be printed alongside the original array. 
Again, like in the CUDA runtime interface, all memcpy_* functions enqueue 
the transfer and wait until it completes. If this is not desired, separate 
memcpy_*_async functions exist. 

3.2 What Comes in the Box 

After the introduction to the basic method of programming CPUs with 
PyCUDA, this section seeks to make the reader aware of further built-in 
facilities aimed at making GPU programming easier. 

3.2.1 GPU Arrays 



import pycuda.autoinit 


import pycuda.gpuarray as gpuarray 


import numpy 


a.gpu = gpuarray.to_gpu( 


numpy.random.randn(4,4) 


.astype(numpy.float32)) 


a.doubled — (2>i<a_gpu).get() 


print a.doubled 


print a_gpu 



Listing 2: An example performing the same function as Listing [T], but using 
GPUArrays. 

The example presented above, as an element-wise vector operation, repre- 
sents not only an easy introduction, but also a common use case of GPU com- 
putations, perhaps in the form of an auxiliary step between other calcula- 
tions. For this reason, PyCUDA supplies an array object, pycuda . gpuarray . GPUArray 
that shrinks the code of the example of Listing [1] to that of Listing [2j Just 
like numpy arrays, but unlike memory allocated using DeviceAllocation, 
GPUArrays know about their shapes and data types. They support all arith- 
metic operators and a number of methods and functions, all patterned af- 
ter the corresponding functionality in numpy. In addition, many special 



functions are available in pycuda . cumath. Arrays of approximately uni- 
formly distributed random numbers may be generated using functionality 
in pycuda. curandom. 

3.2.2 Complex Numbers on the GPU 

In addition to CUDA's built-in support for real numbers, PyCUDA adds 
seamless support for complex numbers. To enable this support, add the line 
"#include <pycuda-complex.hpp>" to your kernel and declare complex 
numbers as pycuda: :complex<type>, where type may be, e.g., double or 
float. PyCUDA's GPUArrays also natively support operations on complex 
data types. 

3.2.3 Double Precision Textures 

Another area where PyCUDA improves on native CUDA is support for 
double precision in texture fetches. This is easiest to achieve when binding 
a GPUArray to a texture reference using the GPUArray's bind_to_texref _ext 
method, while specifying the allow_double_hack keyword argument as true. 
Within kernel code, one may then use the following pattern for texture 
access: 

:^include <pycuda — helpers. hpp> 

texture <fp_tex_double, 1, cudaReadModeElementType> my_tex; 

__globaL_ void f() 

{ 
double X — fp_texlDfetch(my_tex, threadldx.x); 

} 

Observe in particular that the texture type was prefixed with fp_tex_ and 
the fetching function was prefixed with f p_. 

3.2.4 Efficient Evaluation of Element-wise Expressions 

Evaluating deeply nested expressions on GPUArray instances can be in- 
efficient, because a new temporary is created for each intermediate re- 
sult. Many programming languages offering user-definable abstract vec- 
tor types have this problem, for example C++. C++ also offers a way 
of dealing with this particular issue in the form of expression templates 



import pycuda.gpuarray as gpuarray 
import pycuda.autoinit 
from pycuda.curandom import rand \ 
as curand 



a-gpu 
b-gpu 



curand((50,)) 
curand((50,)) 



from pycuda.elementwise import \ 

ElementwiseKernel 
lin_comb — ElementwiseKernel ( 

" float a, float *x, 

" float b, float *y, float *z" , 

"z[i] = a*x[i] + b*y[i]" , 

" linear_combination ") 

c_gpu = gpuarray. emptyJike(a_gpu) 
lin_comb(5, a_gpu, 6, b_gpu, c_gpu) 

(a) An example of the use of the generic fa- 
cihty for element-wise operations. 



import pycuda.gpuarray as gpuarray 
import pycuda.autoinit 
import numpy 

a — gpuarray. arange( 

400, dtype=numpy.float32) 
b = gpuarray. arange( 

400, dtype=numpy.float32) 

from pycuda. reduction import \ 
ReductionKernel 

krni = ReductionKernel( 
numpy. float32, 

arguments^" float *x, float *y" 
map_expr="x[i]*y[i ]" , 
reduce_expr=" a+b" , neutral=" 0" ) 

my_dot_prod = krnl(a, b).get() 

(b) An example of the use of the generic fa- 
cility for reductive ("folding") operations. 



Listing 3: Examples of high-level primitives for working with GPUArrays. 



Veldhuizen and Jerniganl . Il997l |. While we acknowledge that this is cer- 
tainly a matter of taste, we strongly prefer the striking simplicity of both 



use and implementation of PyCUDA's facility, portrayed in Listing 3(a) 
over the significant complexity of expression templates. 

The functionality in the module pycuda. elementwise contains tools to help 
generate and invoke kernels that evaluate complicated expressions on one or 
several operands in a single pass. The instrumental part of the example is 
the invocation of the ElementwiseKernel constructor, in which the user pro- 
vides both a C-style argument list and a statement (or semicolon-separated 
list of statements) to be executed for each value of i, which is used as a 
formal index variable running across each element of GPUArray instances 
passed to the ElementwiseKernel. All these instances are required to have 
the same length. 



10 



3.2.5 Map-Reduce 

In a similar spirit as the support for element-wise operations, PyCUDA has 
built-in functionality for tree-based reductions on the GPU. To make this 
even more useful, evaluating an element-wise expression ahead of reduction 
is also supported. The facil ity thereby becomes a simpl e implementation of 



the MapReduce procedure [Dean and Ghemawatl . l2008l | 



Listing [3(b) I shows the implementation of a dot product through an instanti- 
ation of the ReductionKernel class. The constructor signature starts with 
the specification of the result dtype, in this case numpy.f loat32. It then 
proceeds as the ElementwiseKernel above by allowing a C argument signa- 
ture and an arbitrary C expression on whose results the final reduction will 
be performed. As above, the formal variable i represents the index from 
which the input element should be read. 

The reduction step is then specified by two further arguments, a reduction 
expression of two formal arguments a and b and an expression resulting in a 
neutral element with respect to the reduction expressiono Once all of this 
information is specified, the resulting ReductionKernel may be called as 
above to perform the reduction. 

Observe that the end result of calling the ReductionKernel instance is a 
GPUArray scalar still residing on the GPU. It can be brought to the CPU 
by a call to its get method, or used in-place on the GPU. 

3.2.6 Further Facilities 

In addition to elementwise operations and reductions, versions 2011.1 and 
newer of PyCUDA and PyOpenCL are able to assist the user with the 
implementation of CPU-based parallel prefix sums (also known as 'parallel 
scan'). Further, they provide a number of tools to the GPU implementer 
which we will mention here, but for whose detailed description we refer to 
the reference documentation. 

The first of those is a key optimization for programs that allocate and deal- 
locate CPU memory at a rapid rate. Since CUDA's memory allocation 
functions are relatively expensive operations, it becomes expedient to retain 



^"Neutral element" is mathematical terminology for an element that turns a binary 
operator into an identity map. For example, zero is the neutral element of addition, and 
one is the neutral element of multiplication. 
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Figure 2: Operating principle of GPU code generation. 

already allocated memory in a GPU computing process instead of freeing 
it. These retained blocks of memory may then be reused once a similarly- 
sized block is requested afterwards. PyCUDA supports this through the use 
of memory pools. These pools integrate with GPUArrays, whose arithmetic 
operators are a good example of the need for repeatedly freed and allocated 
memory of recurring sizes. In addition, a memory pool implementation also 
exists for page-locked host memory. 

Lastly, PyCUDA comes with a conjugate-gradient-based Krylov solver for 
large, sparse linear s ystems and an implera entation of sparse matrices on 
the GPU, following |Bell and Garland . l2009l ] . These GPU-based sparse ma- 



trices integra.te dir ectly with sparse matrix support in the SciPy package 



Jones et al.l . l2001-| and provide computati onal performance similar to that 



of NVIDIA's Cusp [Bell and Garlandl . l20ld ] library 



3.3 Code Generation: Benefits and Usage 

As early as 30 years ago, the Lisp community observed that code is data, and 
that using code itself as the object of computation can be greatly beneficial. 
One key benefit of PyCUDA and PyOpenCL is that they make run-time 
code generation ( "RTCG" ) almost trivial. Figure [2] clarifies the workflow 
used in RTCG. 

This section is devoted to describing a number of issues that are commonly 
faced when programming a GPU. Li each case, we point out how a GPU 
RTCG strategy can be used to address these issues. 
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3.3.1 Automated Tuning 

During the creation of a GPU program, it is natural for the programmer to 
come up with a number of variants of a given code, each of which will be 
observed to have certain properties regarding data layout and computation 
speed. The conventional approach to code tuning then calls for the fastest 
variant to survive, while the others will be discarded. This is not necessarily 
a desirable course of action, as information is lost. Instead, it seems more 
appropriate to retain as many of these variants as is practical, assuming that 
they hold at least some promise. Further, each variant may have a number 
of tunable parameters, such as loop lengths, block sizes, etc. Retaining 
variant information permits choosing the best one from a reasonably sized 
pool of candidates in an automated fashion, guided by some metric such 
as execution speed. This is the basic premise of automated tuning, which 
is trivially enabled by GPU RTCG. Further, automated tuning is not just 
enabled by RTCG, it is enabled at the right time-namely at run time- 
when complete information is available. If desired, the re ader may find a 



few i llustrative examples of the use of automated tuning in [Klockner et al. 



20091]. 



3.3.2 The Cost of Flexibility 

Flexibility is commonly seen as a desirable feature of a computer code. 
It should then be realized that flexibility comes at a cost: Constants get 
replaced by variables, formerly fixed loop trip counts become variable, and 
quite generally a compiler has less knowledge available, making its optimizer 
less effective. The process of removing this sort of fiexibility by hard-coding 
such information into the program, therefore, is generally frowned upon. 
However, with the availability of run-time code generation, information can 
be inserted into the source of the program just in time, leading to an optimal 
combination of flexibility and execution speed. 

3.3.3 High-Performance Abstractions 

Nearly all computer programs are built in 'layers', where each individual 
layer solves a certain subproblem and presents a more abstract, 'higher- 
level' interface to higher layers. This is good engineering practice, as it 
allows partitioning a big problem into many smaller ones, and it enables 
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reuse of engineering effort. In some cases, such abstractions can be made 
uneconomical by coding circumstance, namely when customization applies 
to the contents of an inner loop. Many solutions exist to this problem, 
ranging fro m function poin t ers to C++ templates, each with unique dis- 



advantages [Klockner et al.l . |2009| |. Once RTCG is available, this problem 
also disappears as appropriate code can be generated whenever a different 
requirement arises. 

3.3.4 Generating Code 

from mako. template import Template 

tpl — Template(""" 

—global— void add( 

${ type_name } *tgt, 
${ type_name } *opl, 
${ type_name } *op2) 

{ 

int idx = threadldx.x + 

${ thread_block_size } * ${ block_size } 
* blockldx.x; 

% for i in range( block_size ): 

<% offset — i* thread_block_size %> 
tgt[idx + ${ offset }] = 
opl[idx + ${ offset }] 
+ op2[idx + ${ offset }]; 
% endfor 
} ) 

rendered_tpl =tpl.render( 

type_name=" float" , block_size=block_size, 
thread_block_size =thread_block_size ) 

smod — SourceModule(rendered_tpl) 

Listing 4: Run-Time Code Generation (RTCG) with PyCUDA using a 
templating engine. The Example generates a piece of CUDA C from a 
textual template implementing an unrolled version of vector addition, using 
the Mako engine this instance. Full context for the example can be found in 
the PyCUDA source tree as examples/demo jneta_template.py. 
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We now turn to how a user might go about exploiting run-time code genera- 
tion with PyCUDA. Since PyCUDA can natively process CUDA C code, the 
objective is the generation of such code. PyCUDA makes no assumptions 
about the origins of the code it processes, which allows the logic involved in 
the generation to be designed to match the needs of the application. There 
are, however, a few suggested ways of generating code that we have found 
to cover a variety of needs. 

Code generation can (and in many cases should) be seen as a text processing 
task. Since one is not limited in the choice of tools with which to perform 
this sort of generation, code generation typically makes use of existing text 
processing tools. Generation logic itself can thus be simple and generally 
responds favorably to complexity growth. 

Textual keyword replacement. This simple technique performs the equiv- 
alent of search-and-replace on source code. It suffices for a surprisingly 
large range of use cases, such as the substitution of types and constants 
into source code at run time. Its technological reach is increased by 
combining it with C preprocessor macros. Further contributing to its 
attractiveness, Python's standard library can perform keyword substi- 
tution without relying on external software. 

Textual Templating. For code generation applications where control flow 
and conditionals are required, but where all code variants are textually 
related, the use of a so-called templating engine, commonly used for the 
generation of web pages, offers a natural escalation of the capabilities 
of keyword substitution. Many templating engines (and correspond- 
ingly, te niplating lang uages) exist. Listing [4] demonstrates the use of 



the Mak dBaveiJ [201Cl[ | engine for the generation of a simple, partially 



unrolled vector addition code. 

In addition to these methods, the authors' codepy package also supports 
code generation from abstract syntax trees (ASTs), however this use is some- 
what cumbersome and discouraged in all but the most demanding cases. 



4 Evaluation 

While it is difficult to obtain quantifiable data on GPU programmer pro- 
ductivity and how the use of the PyCUDA and PyOpenCL packages affects 
it, one of the main measures of the success of any open-source package is 
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the size and vitality of the community that uses and develops it. 

As such, we believe that the wide existing user base of PyCUDA and Py- 
OpenCL represents compelling evidence that this programming model as 
well as its concrete implementations are a significant improvement in the 
way programmers interact with GPUs, thereby serving as an important step 
toward bringing GPU computing to the mainstream. 



One central point for user collaboration is each package's wiki at http : //wiki . tiker . net/PyCuda 
and http://wiki.tiker.net/PyOpenCL, In a user-editable fashion, each 
functions as a central collection point for installation instructions on a vari- 
ety of operating systems, frequently asked questions, and code examples. 

In addition, a number of packages have been released by community mem- 
bers that build on top of PyCUDA and PyOpenCL, including 

PyFFT (by Bogdan Opanchuk) An FFT package for both PyCUDA and 
PyOpenCL. Also a nice example of cross-CUDA/OpenCL code gener- 
ation, http : //pypi . python. org/pypi/pyf ft, 

Scikits.CUDA (by Lev Givon and collaborators) Offers wrappers of the 
CUBLAS, CUFFT and CULA packages for numerical computation on 



CUDA. http : //pypi . python . org/pypi/scikits . cuda 



Atomic Hedgehog (by Cyrus Omar) Offers a higher-level programming 



interface for PyOpenCL. http://ahh.bitbucket.org/ 



For a listing of project s that use PyCUDA or PyOpenCL in production, see 
Klockner et al.l . l2009l ] and http://wiki.tiker.net/PyCuda/ShowCase 



5 Availability 



PyCUDA is available from http://niathema.tician.de/software/pycuda, 

and PyOpenCL is available from http : //mathema. tician . de/sof tware/pyopencl 

Both are distributed under the liberal MIT open-source software license. 



Full documentation is available online at http : //documen .tician . de/pycuda 



and http : //documen .tician . de/pyopencl respectively, and both pack 



ages include numerous examples and automated tests. The packages support 
all platforms on which Python and CUDA and/or OpenCL are available. 
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6 Future Directions 

PyCUDA and PyOpenCL, as open-source projects, thrive on user feedback, 
particularly feedback regarding limitations, bugs, or missing features that 
users encounter. For example, much of PyCUDA's initial feature set emerged 
in support of an effort to bring discontinuous Galerkin methods onto the 
GPU, as discussed elsewhere in this volume. 

As part of this continuing improvement process, we have identified a num- 
ber of core areas in which we see potential for future work. Much work has 
recently been done to make PyCUDA easier to install. Part of this effort 
was the elimination of the Boost C+-|- library as an explicit dependency, 
which was released as part of PyCUDA version 0.94 and PyOpenCL 0.92 in 
September of 2010. We are also working on easing inte gration with exist- 



ing C UDA tools such as CUBLAS, CUFFT and Thrust [H oberock and Bell 



2010l |. Integration and ease of migration between PyCUDA and PyOpenCL 
is another key feature that we would like to facilitate. We are further 
planning to improve GPUArray's capability of dealing with slices of multi- 
dimensional arrays. Finally, we are planning on improving support for auto- 
mated tuning in PyCUDA and PyOpenCL by providing search algorithms 
on top of user-supplied search space descriptions and speeding up explo- 
ration of compilation-bound searches by exploiting all available CPU cores. 

We hope that this chapter has managed to encourage you, the reader, to 
try what we believe is a very productive, full-featured GPU programming 
environment. We look forward to your questions and comments on our 
mailing lists. 
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